-
-
Notifications
You must be signed in to change notification settings - Fork 85
Add per-metric error handling and fix other issues #230
Conversation
Would it be possible to merge this in time for the September crawl? |
Hi @GJFR. In discussion with @VictorLeP on slack, he pointed me to this PR. I have updated a PR for a robots.txt extraction to capture some data for next year's web almanac. I have finalized the PR here and included test cases: #236 The current robot.txt code in the well-known extraction script has two issues:
I could update my robots.txt script to have the following output. I have already addressed #1 in my PR. I think I can make #2 better, but probably not perfect.
I wanted to get a POV from the Security Team if we should consolidate or edit the well-known to correct some things. |
Hi @jroakes. Thanks for the heads-up. As for the issues you mentioned:
|
Thanks @GJFR I think you make a really good point for pt.2 that I had not considered. FWIW, this is what I had tested in case it is helpful. It solves pt.1, and my solution for pt.2 was to 1) include This also includes regex that should match:
But doesn't clean first to solve for commented values (I did this in robots_txt.js):
Here is test output for Linkedin: I am not that great with JS, so sharing for a POC example only that can be improved. |
Unfortunately, the
well-known.js
custom metric did not so well in the previous crawls due to errors. If oneparseResponse
call fails, the whole metric fails. To solve this, I have separated error handling for each call.The metric would also fail when the site's policy wouldn't allow to load another page (e.g. CORS). This is now also handled for each
parseResponse
call. Since it is not possible to resolve the underlying cause of these errors, I just indicate an error has occurred while fetching a URL.Lastly, I also fixed a capitalisation issue for string comparisons in
/robots.txt
.I also see a lot of 'The user aborted a request.' errors in the our crawl data. Does this have to do with
fetchWithTimeout
?WPT tests (focus on the errors in previous crawls)